Project

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

Explore and visualize the dataset. Build a classification model to predict if the customer is going to churn or not Optimize the model using appropriate techniques Generate a set of insights and recommendations that will help the bank

Data Dictionary:

CLIENTNUM: Client number. Unique identifier for the customer holding the account

Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"

Customer_Age: Age in Years

Gender: Gender of the account holder

Dependent_count: Number of dependents

Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.

Marital_Status: Marital Status of the account holder

Income_Category: Annual Income Category of the account holder

Card_Category: Type of Card

Months_on_book: Period of relationship with the bank

Total_Relationship_Count: Total no. of products held by the customer

Months_Inactive_12_mon: No. of months inactive in the last 12 months

Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months

Credit_Limit: Credit Limit on the Credit Card

Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance

Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)

Total_Trans_Amt: Total Transaction Amount (Last 12 months)

Total_Trans_Ct: Total Transaction Count (Last 12 months)

Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter

Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter

Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Explore the dataset and extract insights using Exploratory Data Analysis¶

• There are no missing values

• Clientnum is a unique identifer and can be dropped.

Observations

• Most of the Customers surveid are current customers and have an open account. • Most of the customers completed graduate school. • Most of the customers are Married. • Most of the customers make less than $40K a year. - This seems weird since they have graduate degrees. • Most customers are blue card members.

Observation on Credit Limit¶

• Average credit limit of a customer is ~ $8,500, Credit limit is right skewed.

Observations on Months on Book¶

• Average months a customer holds a card ~ 35 months.

• Female and Male are more or less equal (with M < F)

Most customers make less than $40K a year, with second going to 40-60K

• 84% are existing customers

• 31% of customers hold graduate degrees, second high school only.

• Utilization is right skewed

• Customers have 2 to 3 dependents

Bivariate Analysis¶

• There is a 1:1 ratio for Average Open and Credit Limit, I will drop one of those.

• Months on Bomoks is highly corrolated with Age.

• Transaction amount and transaction count are also correlated.

Logistic Regression

Predictions and Evaluations¶

Decision Tree¶

Predictions and Evaluations¶

Random Forest

Predictions and Evaluations

Bagging Classifier

Gradient Boosting Classifier

XGBoost Classifier

Grid Search

SMOTE

Logistic Regression on SMOTE over sampled data

• Performance of model on training set varies between 0.80 to 0.83, which is not an improvement from the previous model

• Let's check the performance on the test set

Logistic Regression on undersampled data

Finding Coefficents

Converting coefficents to Odds

Missing value treatment

Building the models as - ver2

Hyperparameter tuning

We will tune - GBM, Adaboost and XGBoost and see if the performance improves.

GBM

GridSearchCV

Observations

RandomizedSearchCV

Adaboost

GridSearchCV

Observations

RandomizedSearchCV

CHECK

---- Removed XGBOOST ----

Removed from models as it does not completed even on 56 Cores machine. Computationally too expensive

Comparing all models

Performance on test set

Pipelines for production model

Conclusions and Business Insights

• GBM did the best in the test data.

• Top three factors that effect credit card attrition: Total Transaction Amount, Total Transaction Counts, and Total Revolving Balance. So, in short, those who keep thier account, use thier credit cards a lot.

• Business should focus on getting customers to use the credit they have more often. This seems to be the best predictor of keeping a customer.

• Over 30% of customers have graduate degrees and make less than $40K a year Attrited Client Profile:¶

• Clients who are contacted a lot are 52% more likly to leave.

• Clients who are inactive during a 12 month period are 49% more likely to leave.

• Clients who have a smaller credit limit.

• Clients who have a 50% smaller revolving balance.